Can language models preserve their own alignment?
Wouldn’t consider this an argument for; rather a project proposal to empirically test how much models remain “good”
Wouldn’t consider this an argument for; rather a project proposal to empirically test how much models remain “good”